An Enhanced Approach to Handle Missing Values in Heterogeneous Dataset
نویسندگان
چکیده
Generally, data mining (sometimes called data or knowledge discovery, knowledge extraction, knowledge discovery) is the process of analyzing huge voluminous data from different perspectives and summarizing it into the useful information. Hence data quality is much important to get the high quality pattern as result. Quality decisions ought to be based on quality data. Data quality is affected by the presence of missing values called holes because of various reasons. In order to make the database as complete by filling the holes with plausible value, variety of imputation methods have been developed. But they are limited to handle missing values in homogenous attributes only. Few of the existing systems uses the mixture kernel function for imputing missing values in mixed attribute datasets. In the proposed work, new imputation framework has been developed to handle missing values in heterogeneous datasets. Firstly pre-imputation is performed using ENI (Encapsidated Neighbour Imputation) approach followed by the application of Gaussian Kernel function to both continuous and discrete attributes. The proposed framework is tested with its competitors for various standard missing rates over bench dataset UCI repository. The behaviour of the framework proposed in this paper is studied using the parameter RMSE and concluded that it is behaving good. Keywords— Data Mining, Missing Value Imputation, Kernel
منابع مشابه
Investigating the missing data effect on credit scoring rule based models: The case of an Iranian bank
Credit risk management is a process in which banks estimate probability of default (PD) for each loan applicant. Data sets of previous loan applicants are built by gathering their data, and these internal data sets are usually completed using external credit bureau’s data and finally used for estimating PD in banks. There is also a continuous interest for bank to use rule based classifiers to b...
متن کاملA Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset
Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...
متن کاملMultiple imputation for national public-use datasets and its possible application for gestational age in United States Natality files.
Multiple imputation (MI) is a technique that can be used for handling missing data in a public-use dataset. With MI, two or more completed versions of the dataset are created, containing possibly different but reasonable replacements for the missing data. Users analyse the completed datasets separately with standard techniques and then combine the results using simple formulae in a way that all...
متن کاملA Comparative Study on Decision Rule Induction for incomplete data using Rough Set and Random Tree Approaches
Handling missing attribute values is the greatest challenging process in data analysis. There are so many approaches that can be adopted to handle the missing attributes. In this paper, a comparative analysis is made of an incomplete dataset for future prediction using rough set approach and random tree generation in data mining. The result of simple classification technique (using random tree ...
متن کاملA BAYESIAN APPROACH TO COMPUTING MISSING REGRESSOR VALUES
In this article, Lindley's measure of average information is used to measure the information contained in incomplete observations on the vector of unknown regression coefficients [9]. This measure of information may be used to compute the missing regressor values.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014